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Abstract 

Priority queues are data structures which store keys in an ordered fashion to allow efficient access to 
the minimal (maximal) key. Priority queues are essential for many applications, e.g., Dijkstra’s single¬ 
source shortest path algorithm, branch-and-bound algorithms, and prioritized schedulers. 

Efficient multiprocessor computing requires implementations of basic data structures that can be used 
concurrently and scale to large numbers of threads and cores. Lock-free data structures promise superior 
scalability by avoiding blocking synchronization primitives, but the delete-min operation is an inherent 
scalability bottleneck in concurrent priority queues. Recent work has focused on alleviating this obstacle 
either by batching operations, or by relaxing the requirements to the delete-min operation. 

We present a new, lock-free priority queue that relaxes the delete-min operation so that it is allowed 
to delete any of the p + 1 smallest keys, where p is a runtime configurable parameter. Additionally, the 
behavior is identical to a non-relaxed priority queue for items added and removed by the same thread. 

The priority queue is built from a logarithmic number of sorted arrays in a way similar to log-structured 
merge-trees. We experimentally compare our priority queue to recent state-of-the-art lock-free priority 
queues, both with relaxed and non-relaxed semantics, showing high performance and good scalability of 
our approach. 

Keywords Task-parallel programming, priority-queue, concurrent data structure relaxation, parallel single¬ 
source shortest path 

1 Introduction 

A priority queue is a data structure for maintaining a set of keys (potentially stored along with some data) 
such that the minimum (or maximum) key can be efficiently accessed and removed. In its simplest form, a 
priority queue supports insertion ( insert) of new keys, and finding and deleting a minimal key (delete-min). 
Priority queues may additionally support deletion of arbitrary keys and decreasing (or increasing) the value 
of a key, as well as operations to meld and split queues. Priority queue operations can generally be performed 
in O(logn) time, some in 0(1) time, but either insert or delete-min must take fl(logn) time (25) . 

Parallel and concurrent priority queues have been the subject of research since the 1980s 
m m ini in ns. While early efforts have focused mostly on parallelizing heap structures [T2], recent 

*This paper is a full version of a poster with the same title presented at PPoPP’15 m- 
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implementations of priority queues H El [H (Ml 12 a were often based SkipLists 120] ■ The randomized 
priority queue [T5] is based on a tree of sorted arrays. 

It is commonly believed that lock- and wait-free data structures provide the best scalability for rnulti- 
progranmred environments, because stalling threads cannot block progress of (all) other threads. However, 
for priority queues, the delete-min operation remains an inherently sequential scalability bottleneck. Recent 
approaches attempt to alleviate this obstacle by batching and elimination 5], or by relaxing the linearization 
requirements for insertions 12611291 and deletions [2] • A very appealing, randomized lock-based approach was 
recently described in [21 j. 

In this paper we present a new, lock-free concurrent priority queue built from a logarithmic number of 
sorted arrays of keys, similar to the log-structured, merge-trees used in databases [181 . It relaxes linearization 
requirements on insertions , thus allowing each thread to batch together up to k insert operations before it 
is required to linearize them wrt. insertions by other threads. The delete-min operation is also relaxed so 
that it may delete and return any of the k + 1 smallest keys visible to all threads. The parameter k can be 
configured at run-time, and can even be changed on a per-key basis. 

Combining both relaxations, our /c-LSM priority queue is allowed to ignore up to p = Tk keys inserted into 
the priority queue at any time (T being the number of threads), but never more. Despite the relaxations, the 
priority queue preserves local semantics per thread, so that keys inserted and deleted by the same thread will 
always be deleted in the correct order. We provide a C++ implementation of this data structure. Experiments 
show high single thread performance, and very good scalability when choosing a reasonably large value for 
k. 

The paper is structured as follows: we discuss general ordering semantics for concurrent data structures 
in Section [2] In Section [3] we explain the sequential LSM algorithm used as basis for our concurrent priority 
queue. Our concurrent algorithm and its implementation is described in Section [4] Section [5] establishes 
correctness and progress guarantees. Finally, we experimentally compare the fc-LSM priority queue to other 
state-of-the-art priority queue implementations using a synthetic benchmark, and a concurrent variant of 
Dijkstra’s algorithm for single-source shortest paths. 

2 Ordering semantics 

Ordering semantics provide trade-offs between scalability and linearizability guarantees on update operations 
(insert and delete-min) as well as read operations ( find-min ). 

Global ordering semantics The strictest possible semantics for concurrent ordered containers is to 
linearize all update operations by all threads with regard to each other. For priority queues this will typically 
lead to high contention on concurrent delete-min operations, since all threads will attempt to remove the 
same item, and only one can succeed. 

Local ordering semantics The other end of the spectrum occurs when threads maintain their own, local 
copy of the data structure. When a thread accesses the thread-local copy of another thread, its operations 
are linearized with regard to operations of other threads accessing the same copy. Operations on distinct 
thread-local data structures are not linearized with regard to each other, and no global guarantees can be 
given. Purely local semantics can be found, e.g., in work-stealing deques [3]. 

Quantitative or p-relaxation Afek et al. [T] introduced an alternative consistency model called quasi 
linearizability , which extends linearizability by allowing operations to occur out of a correct linearizable 
order. Quasi linearizability imposes an upper bound on the distance each operation is allowed to have from 
a correct linearizable operation order. Quasi linearizability has been used as a consistency condition for 
FIFO queues mm, but is not restricted to these. Later work by Henzinger et al. )T0j provided a closely 
related model called quantitative relaxation. 

In previous work [261 22 ], we introduced the term p-relaxation to describe quantitative relaxation on 
data structures, where an upper bound, /?, can be given on the number of items that can be skipped on data 
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Figure 1: Temporal vs. structural p relaxation on a stack for p = 2. The items that can be relaxed by 
structural p-relaxation are marked with p s , by temporal with p t . After the pop, temporal p-relaxation is 
not allowed to skip B , since two items were added to the stack after B , even though C was deleted in the 
meantime. 


structure accesses. We say an item is skipped whenever an operation on the data structure is supposed to 
return an item according to global ordering semantics, but instead returns the item it would return if the 
other item did not exist. This includes the case where a null-value is returned by a data structure, making 
it look empty, since all stored items were skipped. 

We distinguish two types of p-relaxation: temporal and structural as shown in Figure |T| Temporal p- 
relaxation is based on the recency of items, and closely related to quasi linearizability. A temporally p-relaxed 
data structure is only allowed to skip the p most recently added items. Temporal p-relaxation can be applied 
globally , so that only the p most recent items added by any thread can be skipped, or locally , where the p 
most recent items by each thread can be skipped. In contrast, structural p-relaxation is only concerned with 
the number of items that are allowed to be skipped at any point in time regardless of their recency. 

Note that in a p-relaxed data structure all items still need to be accessible by all threads at any point 
in time after their insertion to keep the p-relaxation guarantees in case of stalling threads. The relaxation 
allows to deliberately ignore a certain number of items to improve scalability. 


3 Log-structured Merge-tree 


We now present our new priority queue based on log-structured merge-trees (LSM). Log-structured merge- 
trees [TS] were introduced in the database community, where they are used as a disk-based index structure. 
They consist of a logarithmic number of sorted arrays and provide high efficiency for applications with large 
amounts of item retrievals and removals and rare lookups. The data structure is appealing for priority queue 
implementations because that finding the minimum (maximum) item can be done much faster than other 
lookups. Furthermore, because the data structure is designed to reduce the number of disk accesses and has 
a low number of non-contiguous disk accesses it can likewise be implemented in a cache-efficient manneij)] 
A log-structured merge-tree (LSM) priority queue operates on a logarithmic number of sorted arrays, 
which we call blocks , as depicted in Figure 2(a) Each block is assigned a level , and a block with level l can 


store a sorted array of n keys, where 2 1 ^ 1 < n <2 l . To guarantee logarithmic lookup time for retrieving a 
minimal key, the LSM allows at most one block of each level at any point in time. If this property is violated 
by insertion or deletion of a key, the two blocks with the same level will be merged into one block with the 
next-higher level repeatedly until the property is fulfilled again. 

An insertion is shown in Figure [2] A single key block at level 0 is first created and appended to the 
LSM. Merges are then performed from the end of the list until no two blocks with the same level exist. The 
find-min operation returns the minimal key in the LSM. Since blocks are sorted, the minimal key of each 
block can be accessed in constant time, and the minimum over all minimal block keys must be returned. The 
delete-min operation works similarly, but also removes the minimal key from the LSM. If a block contains 
too few elements for its level after removal, it is shrunk to the next-smaller level and merged with another 
block of same level if necessary. Thus, the amortized complexity of all operations is O(logn). 


1 The LSM presented here was invented independently, based on requirements for concurrent relaxed priority queues; only 
later we discovered the relation to the data structure used in the database community. 
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(a) A log-structured merge-tree (LSM) 
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(b) Adding a new item to an LSM 
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(c) After the first merge 
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(d) After the second merge 

Figure 2: A log-structured merge-tree (LSM) 


4 The fc-LSM priority queue 


The fc-LSM priority queue consists of two separate components, the shared k-LSM and the distributed LSM 
priority queue. Both can also be used as a standalone priority queue, but each has drawbacks that make it 
beneficial to combine it with the other. 

The shared fc-LSM priority queue, described in Section |4.1[ can provide good guarantees on the ordering 
semantics, but suffers from a sequential bottleneck on insertions and various maintenance operations. The 
frequency of both types of operations can be reduced by roughly a factor of k by inserting sorted blocks of 
k keys instead of single keys. We use the distributed LSM priority queue for batching keys together. 

The distributed LSM priority queue, described in Section |4.2[ does not suffer from these sequential 
bottlenecks, because synchronization is kept low. On the other hand, it has only local ordering semantics 
and is also wasteful in memory usage. To achieve stronger ordering guarantees and good space bounds 
we combine it with the shared fc-LSM priority queue, and bound the size of the distributed LSM by the 
parameter fc. Keys from the distributed LSM are transferred to the shared fc-LSM priority queue whenever 


the bounds are reached, as described in Section 4.3 We first assume that a garbage collector is available, 


but describe in Section 4.4 how memory management can be done manually on systems without garbage 
collection. Extensions to the data structure are discussed in Section 14.51 


External interface The external interface of our priority queue consists of two functions: insert and 
try_delete_min. The insert function inserts a single key into the priority queue and always succeeds; 
try_delete_min attempts to find the minimal key (according to the ordering semantics) and delete it. This 
function returns a flag indicating whether the operation succeeded (a key was found and deleted) and the 
minimal key (on success). The operation will fail if the priority queue is empty, and may spuriously fail as 
long as it is guaranteed that a key will eventually be returned given enough attempts. Implementations of 
both functions are shown in Section 14.31 

The interface can be extended by an analogous try_findjnin function, and a size function. The latter 
returns the number of keys in the priority queue and is allowed to be off by up to p from the actual value, 
where p is the relaxation parameter explained in Section [5j 

Shared components Each key in the priority queue is wrapped in a small Item structure. The blocks of 
an LSM consist of pointers to Item instances, and more than one pointer to an Item is allowed to exist. 

An Item stores the key alongside a boolean flag. Items are logically deleted by performing an atomic 
test_and_set on flag, wrapped in the member function Item: :take. Pointers to logically deleted Item 
instances are lazily cleaned out of blocks whenever blocks are resized or merged. 
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Listing 1 Pseudocode for the Block data structure, which stores a sorted array of Item pointers. 

1 struct Block { 

2 atomic<int> filled; 

3 int level; 

4 atomic<Item*>* items; 

5 

6 Block(int Z) :f illed(O) , level(Z) { 

7 items = new atomic<Item*> [ 2 ! ] ; 

8 1 

9 

10 void append(Item* item) { 

11 // Only copy items that are not logically deleted 

12 if C ! item->f lag) items [filled++] = item; 

13 1 

14 

is Block* copy (int Z) -f 
io Block* nb = new Block(Z) ; 

17 for (int i = 0 ; i<filled; ++i) nb->append(items [i] ) ; 
is return nb; 

19 1 

20 

21 void merge_in(Block* bl, Block* b 2 ) ; // not shown 

22 

23 Block* shrink () { 

24 int / = filled; 

25 // Check whether items have been logically deleted 

26 while (/ > 0 && ! items [/ — 1 ]->f lag) —/; 

27 

28 // Check whether we still use the correct level 

29 int l = level; 

30 while (Z > 0 && / < 2 !_1 ) —Z; 

31 

32 // It is necessary to shrink the block 

33 if (Z < level) { 

34 // Recurse, since copy may clean out items 

35 return copy(Z)->shrink() ; 

36 } 

37 filled = /; 

38 return this; 

39 }}; 
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The Block data structure stores items in decreasing key order and is shown in Listing [l] The level of a 
block is the base 2 logarithm of the size of its Item array. The member variable filled stores the actual 
number of items in the block. The append method adds a single item to the end of the block, but only if 
it has not been logically deleted. The copy method copies the block into a new block of given level, and 
uses the append method to filter out all logically deleted items. The merge_in method (code not shown) 
performs a standard two-way merge of the given blocks and stores the result using append. Finally, the 
shrink method scans the end of the block for logically deleted items, decrements filled and if necessary 
copies the items into a smaller block. 

4.1 Shared fc-LSM priority queue 

The shared fc-LSM priority queue uses a fairly straightforward parallelization idea: Blocks in the LSM are 
stored as an array of pointers to blocks, sorted by level. This array is accessible to all threads by a single 
pointer. Each update to the priority queue will atomically replace the pointer to the array by a pointer to 
a new array containing all modifications using a compare-and-swap. 

This copy-on-write mechanism is safe since blocks are not modified once they have been added to the 
array with the exception of the member variable filled, which is modified in the shrink method. While 
there can be data races on accesses to filled, all possible outcomes will lead to the block being in a valid 
state, since the values being written to filled may only be larger than the correct value, but never smaller. 
Merging and shrinking will always modify new blocks and not modify an existing block shared with other 
threads. 


Bottlenecks There are two main sequential bottlenecks in this implementation: updating the shared block 
array, and deleting the minimal item. Updates to the block array can be reduced by doing bulk insertions. 
As a side-effect this leads to the average block being larger, thus also reducing the frequency of shrinking 
and merging operations. We bulk insertions by batching together inserted items using the distributed LSM 


priority queue presented in Section 4.2 


The other sequential bottleneck of the shared fc-LSM priority queue, deleting a minimal item, is inherent 
to priority queues. We reduce its effects by relaxing a delete_min operation to take any of the k + 1 smallest 
keys in the priority queue, instead of only the minimal one. The selection is performed uniformly at random, 
and falls back to taking the minimal item in the same block if the selected item was already deleted by 
another thread. In addition, a thread will always first search for its locally inserted minimal key and per 
default delete this key it if is one of the k + 1 smallest keys. This ensures that the shared fc-LSM priority 
queue fulfills local ordering semantics, as described in Section [2] 


Implementation The array, which is atomically replaced whenever a structural update to the LSM is 
made, is stored in a data structure called BlockArray as shown in Listing [2] This also contains an array of 
pivot indices for each block, which separate keys guaranteed to be among the smallest k + 1 keys from other 
keys. 

The methods insert, consolidate and calculate_pivots modify a BlockArray and may thus only be 
called if it is not visible to other threads. The insert method adds a block to the BlockArray at its correct 
level position, and calls consolidate to ensure that the levels of blocks in the array are strictly decreasing. 

The consolidate method is responsible for shrinking blocks and performing merges if necessary. It 
performs two passes on the BlockArray. In the first pass, blocks are scanned beginning with the smallest 
block and shrunk if necessary. If a block after shrinking has a smaller level than its successor in the 
BlockArray it is merged with its successor. In the second pass, the array is compacted by getting rid of null 
pointers and empty blocks. The return value signifies whether a merge occurred during consolidation. 

The calculate_pivots method selects a pivot value, which is one of the k + 1 smallest keys in the LSM, 
and writes the offset of the first key less or equal to the pivot key in each block into the array k_pivots. 
The copy method creates an identical copy of a BlockArray, by simply copying all data into a new instance. 
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Listing 2 The struct for storing an array of blocks. 


struct BlockArray { 

Block* blocks[MAX_LEVELS]; 
int k_pivots[MAX_LEVELS]; 
int size; 

void insert(Block* block); 
bool consolidateO ; 
void calculate_pivots() ; 

BlockArray* copyO ; 

Item* find_min() { 
int total = 0; 

// Find out how many elements we can choose from 
for (int i = 0; i < size; ++i) { 

total += blocks [i]->filled - k_pivots[i]; 

} // We can assume that total > 0 

// Select one item uniformly at random 
int r = rand_int(0, total - 1); 

// Find block for r 

for (int i = 0; i < size; ++i) { 

int range = blocks[i]->filled - k_pivots[i]; 
if (range <= r) r -= range; 
else { 

// Selected element is in this block 
if (r != range - 1) { 

Item* item = blocks[i]->items [k_pivots[i]+r]; 
// Check whether not deleted 
if (!item->flag) return item; 

} 

// Fall back to minimal element in block 
return blocks[i]->items[blocks[i]->filled - 1]; 


}>}}; 





The find_min method randomly selects one out of the up to k items with smallest key. If that item is not 
marked as deleted it is returned, otherwise the algorithm falls back to the minimal item in the block. 


Listing 3 The shared fc-LSM priority queue. 

1 struct SharedKLSM { 

2 atomic<BlockArray*> shared; 

3 thread_local BlockArray* observed; 

4 thread_local BlockArray* snapshot; 

5 

6 void refresh_snapshot(); { 

7 observed = shared; 

8 snapshot = observed->copy(observed->level); 

9 I 

10 

11 bool push_snapshot () { 

12 if (shared. cas(observed, snapshot)) return true; 

13 return false; 

14 I 

15 

16 void insert (Block* block); // not shown 

17 

is Item* find_min() { 

19 while (true) { 

20 if (shared != observed) refresh_snapshot () ; 

21 if (snapshot == nullptr) return nullptr; 

22 

23 Item* item = snapshot->f ind_min() ; 

24 if (item->flag) { 

25 // Consolidate and push if merges occurred 

26 bool push = snapshot->consolidate () ; 

27 if (snapshot->size == 0) { 

28 snapshot = nullptr; 

29 push = true ; 

30 I 

31 if (push) push_snapshot () ; 

32 } else return item; 

33 }>}; 


The implementation of the shared fc-LSM priority queue itself is found in Listing [3] It mainly consists of 
a pointer to an instance of BlockArray called shared, which is shared between all threads, and two thread- 
local pointers to BlockArray. Each thread uses the thread-local pointer snapshot to store a consistent 
snapshot of shared. Since this snapshot is private to each thread, it is also used as a staging area to prepare 
an update to shared. The thread-local pointer observed stores a pointer to the instance of shared that 
was used to create the snapshot. All three pointers are initialized to null-pointers. 

The ref resh_snapshot method updates the snapshot by first copying the shared pointer to observed 
and then storing a copy of the referenced BlockArray in snapshot. The push_snapshot method is responsible 
for updating the shared array after the private snapshot was modified. This operation may fail if the shared 
array was modified after the snapshot was taken. The insert method inserts a block into snapshot, 
consolidates, and then attempts to push the snapshot. Since this may fail due to the shared array being 
modified, it will retry on a new snapshot until it succeeds. 

The findjnin method finds and returns the minimal key from the snapshot. If this key was already 
marked as deleted by another thread, the array is consolidated. This may trigger a merge operation or result 
in an empty snapshot. In both cases, an attempt is made to replace shared with the modified snapshot. 





Failure to update shared means that another thread already succeeded in consolidating it and thus no retry 
is necessary. 

Local ordering semantics To simplify the presentation, the implementation shown here does not support 
local ordering semantics, which require that a thread will never skip items that it inserted itself. To support 
this we use a Bloom filter with each Block that identifies all threads that contributed items to the block. 
The find_min operation in BlockArray then checks the Bloom filter of each block for the current thread 
ID. The minimal key out of the blocks with the current thread ID in the Bloom filter is then compared with 
the key of the randomly selected item, and the smaller item returned. We use 64-bit Bloom filters with 
two hash-values obtained by tabular hashing. Since the Bloom filters are only updated when two blocks are 
merged, no synchronization mechanism is necessary (the block being merged into is not shared with other 
threads until the merge is completed and the Bloom filter updated). Our implementation used with the 
benchmarks supports local ordering semantics. 

4.2 Distributed LSM priority queue 

The distributed LSM priority queue uses a parallelization idea inspired by work-stealing: each thread main¬ 
tains its own priority queue, and inserts and deletes keys from it. Only when a thread-local priority queue is 
empty, that thread will try to access keys stored in the priority queue of another thread. For this we rely on 
a scheme called spying m- a thread scans the priority queue of a randomly selected thread and copies the 
pointers to items found in that priority queue into its own priority queue. Spying is similar to work-stealing, 
the difference being that it is non-destructive: it does not remove items from the victim’s priority queue. 
This is necessary to preserve local ordering semantics. 

The main advantage of the distributed LSM priority queue is scalability: it is essentially embarassingly 
parallel, and the potential parallelism increases with the number of threads inserting items. The main 
disadvantage is its lack of global delete-min ordering guarantees. To provide ordering guarantees (when 
required), it is necessary to combine the distributed LSM data structure with some other data structure. 

Implementation In the distributed LSM priority queue, each thread has its own instance of the priority 
queue. A shared array storing pointers to all available instances of the priority queue is used for victim 
selection in the spying operation. 

The implementation of such a thread-local LSM priority queue is shown in Listing [4] It consists of an 
array with pointers to blocks, and a size integer giving the number of blocks in the array. Blocks are stored 
in strictly decreasing block level order. The insert method creates a block of level 0 containing the new 
item. It then performs merges with other blocks in the LSM starting from the back until no block of same 
level is found. The old (non-merged) blocks stay accessible throughout the merge. The block containing all 
merged blocks is made visible replacing the old blocks once all merges have been performed. All items stay 
accessible throughout the merge, but may be encountered twice by spying threads. 

The consolidate method scans the block array for blocks needing to be shrunk, and performs block 
merges if necessary. It will only remove references to blocks being consolidated, after the consolidated blocks 
are made available. This ensures that a spy operation has the chance to find all items stored in the LSM 
of its victim at all times. The f ind_min method behaves similarly to its counterpart in the sequential LSM 
data structure. 

The spy method scans through the blocks of the victim and copies the blocks it finds. Since the underlying 
array at the victim may be modified during the scan, care is taken only to copy blocks that will not violate 
the invariant that blocks are stored in strictly decreasing order. It is not necessary for the blocks obtained by 
spy to be a consistent snapshot, since the distributed LSM priority queue does not provide any guarantees 
on the ordering of items created by other threads. 

Space Efficiency To ensure better space bounds, spy can be restricted to never copy more than k items 
(not shown in code for simplicity). Since spy can only be called if the thread-local LSM is empty, this will 
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Listing 4 The distributed LSM priority queue, 

struct DistLSM { 

atomic<Block*> blocks[MAX_LEVELS]; 
int size; 

void insert(Item* item) { 

Block* b = new Block(O); 
b->append(item); 

int i = size; 

// Note: Merge is non-destructive. Old blocks stay 
// available throughout the loop 

while (i > 0 fck blocks[i-1]->level <= b->level) { 
Block* b2 = new Block(b->level + 1); 
b2->merge_in(blocks[i-1], b); 
b = b2->shrink(); 

—i; 

> 

// Only now replace largest of old blocks with block 
// containing all merged items 
blocks[i] = b; 

// Now redundant blocks become inaccessible 
size = i+1; 

> 

void consolidateO ; // not shown 
Item* find_min(); 

bool spy(DistLSM* victim) { 

for (int i = 0; i < victim->size; ++i) { 

Block* b = victim->blocks[i]; 
int l = b->level; 

if (b != nullptr && (size == 0 I I 

l < blocks [size-1]->level) ) ■[ 
blocks[size++] = b->copy(0 ; 

} 

> 

return size != 0; 
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ensure that no more than 0(n + Tk) space is used, where n is the number of unique items in the distributed 
LSM and T the number of threads. Note that such a restriction will implicitly happen for the combined 
data-structure described in the next section, since no blocks with a level larger than log 2 k — 1 will exist to 
spy from, and each spied block must have a level smaller than the previously spied block. 

4.3 Combining both priority queues 

We now combine the shared fc-LSM priority queue and the distributed LSM priority queue into a single data 
structure, as shown in Listing [5j This priority queue consists of one distributed LSM priority queue per 
thread, a single shared /c-LSM priority queue, and a victim array used for spying. 

The insert method for the combined data structure simply creates an item and inserts it into the 
distributed LSM priority queue using a modified version of its insert method. In this modified version, 
if after performing all necessary merges the resulting block is bigger than a certain size, the block will be 
inserted into the shared fc-LSM priority queue instead of being added to the thread-local DistLSM. 

In the delete_min method both priority queues are checked for the minimal item, and the minimum of 
both marked as deleted using an atomic test-and-set on the flag of the item returned. If the found item 
was already deleted by another thread, this process is repeated. If both are empty, spy is called and the 
operation repeated if spy was successful. 


Listing 5 The fc-LSM priority queue. 

1 struct KLSM { 

2 thread_local DistLSM dist; 

3 SharedKLSM shared; 

4 DistLSM* victims[NUM_THREADS] 

5 

6 void insert(Key key) { 

7 Item* item = new Item(key); 

8 dist.insert(item, &shared); 

9 1 

10 

11 bool delete_min(Key& key) { 

12 do ■[ 

13 Item* item; 

14 

15 do { 

1 6 item = dist.find_min(); 

17 if (item == nullptr) item = shared. find_min() ; 

is else { 

19 Item* tmp = shared. find_min() ; 

20 if (tmp ! = nullptr && tmp->key < item->key) 

21 item = tmp; 

22 } 

23 // Try marking as deleted and repeat otherwise 

24 } while (item != nullptr && 

25 (item->flag I I ! item->flag. test_and_set () ) ) ; 

26 

27 if (item != nullptr) return item; 

28 

29 // Try finding items at other threads 

30 } while (spy(victims [rand( 0 ,NUM_THREADS- 1 )] )) ; 

31 return nullptr; 

32 }}; 
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4.4 Memory Management 

So far we ignored the issue of memory management to simplify the presentation of the algorithm. In systems 
with garbage collection, we only have to clean out dangling pointers from the LSM arrays if filled is 
decremented to ensure garbage collection of the given items. 

In systems without garbage collection, like in our C++ implementations, we have to manage memory 
manually. We use the wait-free memory reuse scheme by Wimmer [2fi| to manage Item instances. Since the 
scheme is not ABA safe, we change the flag variable in Item to an integer, which allows items to be marked 
as deleted in an ABA-safe manner by incrementing flag with an atomic compare-and-swap. Blocks store 
the expected flag value together with each pointer to Item. 

It is guaranteed that no thread will need more than four instances of Block per level at any point in 
time, which will be allocated on first access. In spy, blocks need to be verified after copying all items to 
ensure ABA safety. 

Two instances of BlockArray per thread are sufficient. For ABA safety a version number is maintained 
for each instance. We allocate these instances aligned to 2048-Byte boundaries, allowing us to steal the ten 
least significant bits of a pointer to BlockArray, and work around the ABA problem by stamping the pointer 
with a truncated version number. The full version number is used for verification most of the time, and only 
a compare-and-swap on shared needs to rely on the truncated value. To reduce the risk of wraparounds of 
the truncated value resulting in an ABA problem, the full version numbers are verified directly before each 
compare-and-swap to minimize the time-frame in which a wrap-around may occur. Ten bits, and minimizing 
the time-frame are more than sufficient to make this safe in practice. 

4.5 Extensions 

The decrease-key operation is an often important operation on priority queues, but is hard to implement 
in parallel. It can be worked around in a non-linearizable way by deleting a key and reinserting it with 
its new value. But even deletions can be hard to accomplish, since this either requires keys to be unique 
(a requirement we do not place on our priority queue), or to maintain a unique id or pointer to each item 
that is exposed to applications. In addition, not all types of priority queues are suitable for efficient random 
deletions of keys. We resolve this issue by a lazy deletion scheme similar to the one presented by Winnner et 
al. |29j : the priority queue can query whether an item needs to be deleted. This can be performed whenever 
it is convenient for the priority queue, which for the LSM is whenever items are copied into a new block 
(deleted items do not need to be copied) or when checking whether a block can be shrunk (here deletion 
happens for items at the end of a block). We use this lazy deletion technique in the implementation of our 
SSSP benchmark application in Section [6] 

Another common operation is meld, where two priority queues are merged into a single one. This is also 
easy to provide on LSM data structures, since merging lies at the heart of the LSM idea. However, it is hard 
to provide this in a linearizable fashion. We leave this problem for future work. 


5 Correctness 

In this section we prove the linearizability of the operations on our data structure according to structurally 
p-relaxed and local ordering semantics. In addition, we show that our priority queue fulfils the lock-free 
progress guarantee. 

Lemma 1 The insert operation is linearizable. 

PROOF: An insert operation that leads to a Block being published in the shared fc-LSM priority queue 
is linearized at the linearization point of push_snapshot, which is the compare-and-swap on shared. An 
insert operation, which inserts the Block with the inserted item into the distributed LSM priority queue is 
linearized when the block (which is potentially a result of multiple merges) is written into the array blocks. 
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In both cases, the inserted key becomes reachable for all threads at the linearization point, and stays reach¬ 
able at least until it is marked as deleted. □ 


Lemma 2 The try_delete_min operation is linearizable with structural p-relaxation semantics, where p = 
Tk, T being the number of participating threads and k the relaxation configuration parameter. It also fulfils 
local ordering semantics. 

PROOF: A successful try_delete_min is linearized at the point when the private snapshot of shared is 
verified by comparing shared with observed for the last time. Afterwards, both the thread-local DistLSM 
and the private snapshot of shared are treated as snapshots and the structurally p-relaxed minimum is 
searched for. Any inconsistencies would lead to failures and thus repetitions, including another comparison 
of the two arrays. 

If an item is successfully marked as deleted it is guaranteed to have one of the p = Tk smallest keys at 
the time the snapshot of shared was taken. At most (T — 1 )k keys from the thread-local DistLSMs of other 
threads can be skipped, since the size of a DistLSM is bounded by k. Another k keys can be ignored due to 
the uniformly random selection out of at most k + 1 keys in SharedKLSM: :find_min. 

An unsuccessful try_delete_min (which includes spurious failures) is linearized at the point when shared 
is found to be a null-pointer after the thread-local DistLSM has been found empty. At this point at most 
(T — 1 )k keys can be skipped, which is the maximum number of keys that can be stored in total in the 
thread-local DistLSMs of other threads. Since spy is allowed to spuriously fail it does not contribute to the 
linearization point of a failure even though it happens after shared is found to be a null-pointer. 

Local ordering semantics are fulfilled, since the thread-local DistLSM is always checked for the minimal 
key, and also all blocks in SharedKLSM, which might contain items added by the given thread. □ 


Lemma 3 The insert operation is lock-free. 

PROOF: An insert that only operates on the thread-local DistLSM is wait-free, since no other thread can 
block progress. An insertion on the shared fc-priority queue may fail to update shared so that the insertion 
has to be restarted with a new snapshot, but only if another thread succeeded in modifying it. Since at 
least one thread has to make progress for another thread to fail updating the shared array, the algorithm is 
lock-free. □ 


Lemma 4 The delete_min operation is lock-free. 

Proof: The find_min methods of both priority queues are wait-free, since they operate on thread-local 
snapshots that are not influenced by the operations of other threads. 

While the deletennin operation might need more than one attempt to delete an item even in a quiescent 
state, due to logically deleted items not being physically removed, subsequent clean-up operations on such 
failures ensure that an active item is eventually found, or the priority queue empty. Other threads may 
increase the number of iterations until success by deleting additional items, but this means that another 
thread made progress, and thus deletejnin is lock-free. □ 


6 Evaluation 

We now compare our lock-free fc-priority queue to the recent, relaxed lock-free priority queues by Alistarh et 
al. [2] and Wimmer et al. [29], and the randomized lock-based Multi-Queues by Rihani et al. |2T]. We show 
that our data structure has high absolute performance, due to its cache-efficient layout and small amount of 
synchronization operations, and can provide good scalability. 
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Since we focus on priority queues with relaxed semantics we cannot show a full comparison to current 
state-of-the-art, non-relaxed priority queues. Instead, we picked the lock-free priority queue by Linden and 
Jonsson [14] as a representative of such priority queues. We note again that scalability of exact priority 
queues is limited. While it is often possible to achieve good scalability on insertions mm, deleting the 
minimal key is inherently sequential. So for deletions, a perfectly scalable exact priority queue can at best 
keep the absolute throughput constant with the number of threads. 

For comparison with the priority queue by Linden and Jonsson, as well as the SprayList and Multi-Queues 
we rely on a throughput benchmark, which lets all threads randomly insert and delete keys from a priority 
queue that is prefilled with a given number of keys. This benchmark creates extremely high contention on 
the priority queue and thus gives good insight into the practical scalability a priority queue. The same 
benchmark was also used by Alistarh et al. [2] and Rihani et al. m to evaluate their data structures and 
thus facilitates direct comparability. In the experiments the ratio between insertions and deletions is 50-50. 

The data structures by Wimmer et al. [29] are part of a task scheduling system, and cannot be used 
as standalone data structures. This makes it impossible to do a fair comparison with these data structures 
with the throughput benchmark. Instead, we adapted the SSSP benchmark of [29] to evaluate their priority 
queues. This benchmark is a label-correcting version of Dijkstra’s single-source shortest path algorithm, 
which is parallelized in a straightforward manner using a concurrent priority queue. It uses a lazy deletion 
scheme in connection with reinsertion of keys instead of an explicit decrease-key operation. We adapted our 
priority queue to support this lazy deletion scheme as described in Section |4.5[ Since this extension has a 
significant impact on performance, but is not available in the other priority queues, and since there was no 
easy way to directly delete keys from these priority queues with the available implementations either, we 
decided not to include these data structures in the SSSP benchmark. 

Environment We have implemented our data structure in C++. We relaxed the memory consistency 
for operations on synchronization wherever possible to reduce the synchronization cost. We used manual 
memory management for our priority queue as described in Section [4.4| Our priority queue implementation, 
as well as the benchmarks used in this paper are available for download as part of the Pheet frameworl0 

All experiments were performed on an 80-core Intel Xeon based system comprised of eight Intel Xeon 
E7-8850 processors of 10 cores each. The system has 1TB of main memory. The experiments were run under 
Debian Linux and were compiled using gcc 4.9.1. We verified our findings using a 48-core AMD Opteron 
based machine. Due to space considerations, and since the Intel system is closer to the machines used by 
Alistarh et al., Sanders et al. and Wimmer et al., we give results only for the Intel system. 

Methodology The throughput benchmark was run for 10 seconds for each experiment, and the average 
throughput per second is shown (for a 50-50 mix of insertions and deletions). SSSP was run on Erdos-Renyi 
random graphs with 10000 nodes and edge probability 50%; edge weights are randomly chosen integers in 
the range [1,100000000]. All experiments were repeated for 30 times. Mean values with confidence intervals 
are reported in the plots. 

6.1 Results 

Results for the throughput benchmark can be seen in Figure [3] The benchmarked implementations are a 
binary heap protected by a spin-lock, the skiplist-based priority queue by Linden and Jonsson, the SprayList, 
the Multi-Queue by Rihani et al., our priority queue with different values for k , and the distributed LSM 
priority queue (DLSM), which is our priority queue without p-relaxation guarantees. 

We first observe that in a sequential setting, the performance of the DLSM is close to the binary heap, 
and due to the local ordering semantics also provides the same guarantees. Our priority queue with k = 0 
is significantly slower due to the high overhead of maintaining the shared fc-LSM priority queue. We note 
that the Linden and Jonsson priority queue is slightly slower than fc-LSM with k = 0 due to the worse 

* www.pheet.org 
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Figure 3: Throughput per Thread per second for priority queues prefilled with 10 6 (left) and 10' (right) 
elements. We use throughput per thread for better readability, and thus a constant throughput in the plot 
corresponds to a linear speedup. 


cache-efficiency of skip-lists and the higher number of required synchronization operations per insertion and 
deletion. 

The Linden and Jonsson priority queue scales well for a non-relaxed priority queue and can outperform 
the binary heap with the global lock, as well as our priority queue with k = 0. The performance and 
scalability of our data structure improves drastically with higher Zc, making it easy to beat non-relaxed 
priority queues. Our DLSM even scales superlinearly, due to the smaller number of keys stored per thread 
as the number of threads increase. 

The experiments for the SprayList were run with the default settings of the implementation. We had to 
increase the number of allocators in the custom allocator to 10, otherwise the algorithm ran out of memory. 
We were only able to run the benchmark for one second (instead of ten), again due to the algorithm running 
out of memory. The SprayList has better scalability than the Linden and Jonsson priority queue. While it 
seems to scale better than our priority queue with k = 256, its absolute performance is worse, except for the 
case of 10 7 elements and 80 threads. 

As for the relaxation, the SprayList will return one of the 0(T logi^T) smallest keys whp. as opposed 
to our priority queue, which will return one of the Tk smallest keys on delete_min. The exact constant 
factors of the SprayList are not documented, making a direct comparison with our priority queue difficult. 
To give an example, for 64 threads, the SprayList will have p = 13824c for some constant c > 1, whereas 
our priority queue has p = 64 k for a configurable k (e.g. for k = 256, p = 16384). Note that for /c-LSM 
p is the worst-case bound. For SprayList the worst-case cannot be bounded, since the SprayList can be 
arbitrarily modified while the deleteunin operation traverses the list. The SprayList also does not provide 
local ordering semantics. 

The Multi-Queue experiments were run with the Boosl[^]8-ary heap enabled and a setting of c = 2, which 
means that 2 T priority queues are used. According to the inventors of the Multi-Queue, the expected quality 
should roughly correspond to the /c-LSM priority queue with k = 4. The Multi-Queue exhibits very good 
performance and scalability. Unfortunately no worst-case bounds on the quality can be given for the Multi- 
Queue’s delete-min operation, since a stalling thread can block delete-min access to an arbitrary number of 
keys. 

Comparing the results for different prefill values, we observe that our implementation performs better 
for a fixed k when less elements are stored in the priority queue. This comes from the fact that a larger 
amount of items in the priority queue increases the probability that the selected minimal key in the shared 
/c-LSM is smaller than the minimal key out of the thread-local DistLSM for this benchmark. Thus the limited 
scalability of the shared /c-LSM priority queue becomes more dominant. 

Results for the SSSP benchmark are shown in Figure |4j We compare our priority queue (/c-LSM) against 
the centralized /c-priority queue and hybrid /c-priority queues by Wimmer et al. [29 ;. What can be seen here 

"www.boost.org 
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Figure 4: Execution times for SSSP benchmark for varying numbers of threads (fc = 256) and values for fc 
(10 threads). 


is that while the algorithm used by the benchmark seems to have limited scalability so that it does not scale 
to more than 10 threads, our priority queue provides stable performance while the other priority queues 
experience a slowdown with more threads. 

We also explore the influence of the relaxation parameter k on the performance of the algorithm. For 
this we fixed the number of threads to 10 and varied the parameter k. For the centralized fc-priority queue 
no visible difference between different values for k can be seen, making it the best priority queue for k = 0, 
but not the best of all. For the other priority queues there exists a best value, where the additional work 
that needs to be performed due to the relaxation does not outweigh the scalability gains. For the hybrid 
fc-priority queue this value is somewhere around fc = 4096, where on average 305 additional iterations needed 
to be performed compared to a sequential execution (not shown in graphs). For our priority queue we found 
such a best value at k = 256 (+362 iterations), but also had good results with fc = 16384 (+3965 iterations). 

7 Conclusion 

We presented a novel, relaxed concurrent priority queue with lock-free progress guarantees. In a sequential 
setting, its performance is comparable to a binary heap, and exhibits very good scalability for a reasonably 
large value for the relaxation parameter fc. Performance- and scalability-wise it is competitive to other, recent, 
lock-free relaxed priority queues. In particular, it seems superior to the SprayList J5] in most settings, and 
has the advantage of providing fixed and easy to calculate relaxation guarantees and local ordering semantics. 
The sensitivity of our priority queue to the number of queued elements makes the SprayList the better choice 
for settings with a huge number of elements and high contention. In comparison with the priority queues by 
Wimmer et al. [22], we provide both better scalability and absolute performance for the SSSP application. 

The distributed LSM priority queue has high scalability, except for spying for which the cost increases 
with the number of items that are spied. We intend to resolve this by allowing spy operations to copy 
read-only references to blocks. However, the main scalability bottleneck of our priority queue is the shared 
fc-LSM priority queue. We see lots of potential for improvement there, utilizing techniques like lazy merging 
and helping schemes. 

The appealing Multi-Queue by Rihani et al. [2T! seems to provide better trade-offs between expected 
quality and scalability, but cannot provide any worst-case guarantees. We hope to collaborate with Rihani 
et al. to devise a relaxed priority queue that combines the best of the two ideas. 

Finally, an interesting area to explore is priority queues with elimination and batching as presented by 
Calciu et al. [8]. We believe that the LSM priority queue is well-suited for batching techniques and it might 
be possible to achieve scalability for our data structure using these techniques instead of relaxation. 
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